Skip to content

[PATCH] util: fix max socket calculation#10

Closed
3d0c wants to merge 1 commit intoNVIDIA:nvidia_stable-11.8from
3d0c:fix/physical_package_id
Closed

[PATCH] util: fix max socket calculation#10
3d0c wants to merge 1 commit intoNVIDIA:nvidia_stable-11.8from
3d0c:fix/physical_package_id

Conversation

@3d0c
Copy link

@3d0c 3d0c commented Jan 3, 2026

Problem Description

Libvirt determines the number of CPU sockets by reading values from topology/physical_package_id. The existing logic implicitly assumes that these values are small, zero-based, and contiguous integers (0, 1, …, N). This assumption holds true on most systems.

However, on some platforms (for example, NVIDIA GB200), physical_package_id contains large, non-contiguous numeric identifiers such as 268435234 or 285212456. These values are identifiers, not indices.

Original implementation incorrectly treated the maximum numeric value of physical_package_id as the socket count. This value was then used as an upper bound in multiple code paths, including memory allocation and iteration logic. For example:

for (i = 0; i < sock_max; i++)
cores_maps[i] = virBitmapNew(0);

On affected systems, this resulted in loop bounds in the hundreds of millions, leading to:

excessive memory allocation
extremely large iteration counts
high CPU consumption
OOM

On Kubevirt this is triggered immediately by GetDomainStats and causes virtqemud to run out of memory:

bash-5.1$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
qemu           1  0.0  0.0 1751232    0 ?        Ssl  10:57   0:00 /usr/bin/virt-launcher-monitor --qemu-timeout 273s --name ubuntu-fat --uid a17a79bd-bdbd-4afd-9c98-42f3e525f7b0 --namespace default --kubevirt-share-dir /var/run/kubevirt
qemu          13  0.1  0.0 3752064 36864 ?       Sl   10:57   0:01 /usr/bin/virt-launcher --qemu-timeout 273s --name ubuntu-fat --uid a17a79bd-bdbd-4afd-9c98-42f3e525f7b0 --namespace default --kubevirt-share-dir /var/run/kubevirt --epheme
qemu          31 87.2  8.2 83683200 82460032 ?   Sl   10:57  14:24 /usr/sbin/virtqemud -f /var/run/libvirt/virtqemud.conf

The fix has been validated on:

  • libvirt 10.10
  • libvirt 11.9

This patch changes how the maximum socket count is calculated.

On some systems (e.g. GB200), physical_package_id values are not
contiguous or zero-based. Instead of 0..N, they may contain large
arbitrary identifiers (e.g. 256123234). The previous implementation
assumed a 0..N range and used the maximum ID value directly.

This caused:
    excessive memory allocation
    extremely large loop bounds
    OOM / DoS scenarios
    unnecessary CPU time consumption

The new implementation computes the socket count as the number of unique
package IDs present on the node, rather than relying on the maximum numeric
value.
@3d0c 3d0c force-pushed the fix/physical_package_id branch from 824c4b0 to df3a519 Compare January 4, 2026 12:07
@NathanChenNVIDIA
Copy link
Collaborator

Closing as we are picking upstream commit for this fix in the following PR:

#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants